Spatio - Temporal Information Production and Consumption of Major U . S . Research Institutions
نویسنده
چکیده
This paper reports the results of a large scale data analysis that aims to identify the information production and consumption among top research institutions in the United States. A 20-year publication data set was analyzed to identify the 500 most cited research institutions and spatio-temporal changes in their inter-citation patterns. A novel approach to analyzing the dual role of institutions as information producers and consumers and to study the diffusion of information among them is introduced. A geographic visualization metaphor is used to visually depict the production and consumption of knowledge. Surprisingly, the introduction of the Internet does not seem to affect the distance over which information diffuses as manifested by citation links. The citation linkages between institutions fall off with the distance between them, and there is a strong linear relationship between the log of the citation counts and the log of the distance. The paper concludes with a discussion of these results and an outlook for future work. Introduction Does space still matter in the Internet age? Does one still have to study and work at major research institutions in order to have access to high quality data and expertise and to produce high quality research? To answer these questions, an interdisciplinary publication data set covering the years from 1982-2001 was analyzed to identify the 500 most cited research institutions in the United States and spatial changes in their inter-citation patterns. Advanced data analysis and visualization techniques were applied to determine information sources and sinks and the diffusion patterns among them. The results of our analysis are surprising in that the increasing usage of the Internet does not lead to more global citation patterns. In particular, the distance over which information diffuses as manifested by citation links does not increase over time. The remainder of the paper is organized as follows: Section 2 reviews related work and contrasts it with our approach; Section 3 describes the data set used in this analysis and how it was processed; Visualizations of the data set are presented in section 4; Section 5 concludes the paper with a discussion of results and future work. Related Work and Our Approach The diffusion of tangible objects (people, goods, etc.) but also of intangible objects (ideas, activity levels, etc) has been studied in diverse fields of science including physics, e.g., heat diffusion; robotics, e.g., communication among mobile robots (Arai, Yoshida et al. 1993); social network analysis (Granovetter 1973; 2002); bibliometrics/scientometrics/webometrics 1 This research is supported by Career grant no. IIS-0238261 from the National Science Foundation awarded to the first author. We gratefully acknowledge support from the Center for the Study of Institutions, Population, and Environmental Change at Indiana University through National Science Foundation grant BCS-0215738. Börner, Katy & Penumarthy, Shashikant. (in press) Spatio-Temporal Information Production and Consumption of Major U.S. Research Institutions. Accepted at the 10th International Conference of the International Society for Scientometrics and Informetrics, Stockholm, Sweden, July 24-28. (Katz 1994; Thelwall 2002), geography, e.g., migration studies (Ravenstein 1885; Thornwaite 1934; Tobler 1995); and biology, e.g., neuronal migration in the nervous system (Thurner, Wick et al. 2002). Other studies have attempted to judge the research vitality or quality of research conducted at specific research institutions. Diverse activity, impact, and linkage measures exist and can be applied to quantify the research contribution of institutions (Narin, Olivastro et al. 1994). However, very few citation studies have attempted to analyze the geographical concentration of highly cited authors, institutions, countries. Batty’s (2003) work is an exception and it nicely shows that the distribution of citation counts is highly skewed, with most citations being associated with a few individuals working at a small number of institutions in an even smaller number of places and countries. Here, we are interested to study the diffusion of scholarly knowledge. We assume that scholarly knowledge diffuses via co-authorships, the physical movement of authors through geographical space and the production (writing) and consumption (citing) of papers, among others. Unfortunately, the identification of unique author names is unresolved. Similarly, proper contribution of an author to his or her institution is often impossible due to the quality of available publication data. Our work goes beyond existing research in that we do not only examine the citation counts for each institution but attempt to (1) identify geographically and statistically significant instances of institutions that act as major information sources, (2) correlate their behavior as information sources (number of citations their papers receive), information sinks (number of references to papers produced at other institutions), and self-consumers (number of self citations), (3) use direct citation linkage to identify their interrelation based on the amount of directly exchanged information, and (4) analyze and visualize the importance of proximity in geographic space for information exchange. Subsequently, we formalize each institution as a node that acts as both: a source (or producer) of information as well as an information sink (or consumer). Arrows among institutions denote the flow of information. If a paper was published at institution A and is cited by a paper that is published at institution B, then there will be an arrow going from A to B. The more papers produced at A are cited by B, the higher the volume of information flow. Hence, the normalized out-degree of a node can be used to characterize the role of an institution as an information source. The normalized in-degree of a node describes the role of an institution as an information sink. Links which lead from an institution to itself correspond to self-citations. Note that this formalization could also be applied to authors, countries, etc. Data Set and Data Analysis The complete set of papers published in the Proceedings of the National Academy of Sciences (PNAS) in the years from 1982-2001 was analyzed to determine knowledge diffusion pathways among major institutions as manifested in paper citation linkages among the papers. The data set contains 47,073 papers published by 18,994 unique authors, who work at 2,822 institutions. Institutions comprise academic institutions, research labs and corporate entities. To be credited with an article, a given institution had to be the site of the first author listed on the paper. The paper most highly cited by papers within the set received 612 citations. Given our interest in exploring the importance of spatial proximity for the diffusion of information within U.S., we decided to analyze information diffusion patterns among major institutions, the spatial position of which is uniquely and persistently identified by their zip code and corresponding longitude and latitude coordinates. By ‘major institutions’, we refer to institutions that have acquired the highest total number of citations for their papers. Börner, Katy & Penumarthy, Shashikant. (in press) Spatio-Temporal Information Production and Consumption of Major U.S. Research Institutions. Accepted at the 10th International Conference of the International Society for Scientometrics and Informetrics, Stockholm, Sweden, July 24-28. An initial data cleaning step was performed to remove suffixes such as INC, MED. These suffixes serve to indicate whether the entity in question is a corporate entity, a research lab or an academic institution. However, these suffixes are not consistent with respect to spacing between the name of the institution and the suffix, leading to string matching problems. Removing these suffixes helps to create uniformity of institution names in the data set. Next, we had to decide what institutions should be merged. For example, an institution such as Indiana University has several campuses. Collapsing all these campuses into one entity causes valuable geographic information to be lost, since the campuses might be far apart. However, separating out each campus individually can result in extremely cluttered data. Another significant issue that arises out of separating different campuses of the same university is the distribution of the number of citations among those campuses. For example, Indiana University as a single entity might qualify to be in the top 500 most highly cited institution list, but when the campuses are split, none of the individual campuses might have the requisite number of citations to make it into this list. The zip code was used to preserve information about where two institutions with the same name, but with differing geographic locations, are located. The United States zip code assigns postal codes based on the position of a certain geographic location in a hierarchy of geographic significance based on area. Hence, in the 5-digit zip code, the first digit indicates which region of the U.S. the location belongs to such as northeast, southwest, etc. The next two digits indicate state and county information. The final two digits serve to distinguish finer boundaries such as towns and cities within a county. A unique ID was created for each institution by concatenating the (abbreviated) name of the institution with its zip code. As this system is unique to the United States, non-U.S. institutions, such as University of Tokyo (1,797 citations), despite producing highly cited publications, were excluded from the analysis presented in this paper. We then proceeded to determine the level of geographic resolution that is significant for answering our question. Given that universities typically do not have two major campuses in one county we decided to use the county as our smallest unit. Hence, for each institution, all its campuses or instances that lay within the same county were collapsed into one entity. In zip code terms, this meant merging all instances of an institution whose zip codes differed only in the last two digits. The newly created identity of the institution consisted of a concatenation of the (abbreviated) name with the smallest zip code within that county. For example, INDIANA UNIV47401 and INDIANA UNIV47405 were collapsed into INDIANA UNIV47401. Collapsing universities in this manner provides a good compromise between maintaining geographic identity and statistical significance. Subsequently, the top 500 most highly cited institutions were identified. The top 500 institutions produced 30,572 (64.95%) of all papers and received 195,889 (51.83%) of a total of 377,935 citations. A graph showing the number of listed references, received citations, and self citations over the alphabetically sorted list of institutions is given in Figure 1. An offset was applied to citation counts to improve readability. Exactly five institutions produced papers that attracted more than 4,000 citations. Harvard (HARVARD UNIV02114) leads with 13,763 citations. MIT (MIT02139) follows with 5,261. Johns Hopkins University (JOHNS HOPKINS UNIV21201) has 4,848. STANFORD UNIV94302 accumulated 4,546 and UNIV CALIF SAN FRANCISCO94103 got 4,471. For each institution we determined the ratio of the number of citations received by this institution divided by the sum of received citations and references made, multiplied by 100. Interestingly, there are 131 institutions with a value between 0-40% acting mostly as information producers. 71 of the institutions have a value between 60-100% and act mostly as Börner, Katy & Penumarthy, Shashikant. (in press) Spatio-Temporal Information Production and Consumption of Major U.S. Research Institutions. Accepted at the 10th International Conference of the International Society for Scientometrics and Informetrics, Stockholm, Sweden, July 24-28. information consumers – they reference a large number of papers but the number of citations they receive is comparably low.
منابع مشابه
Spatio-temporal analysis of the covid-19 impacts on the using Chicago urban shared bicycles by tensor-based approach
Cycling is a phenomenon in urban transportation that has the ability to allocate a specific location at any moment in time. Accordingly, spatial analysis of bicycle trips can be accompanied by temporal analysis. The use of a GIS environment is commonly recommended to display the extent of the phenomenon's spatial changes. However, in order to apply and display changes over time, it will requir...
متن کاملSTCS-GAF: Spatio-Temporal Compressive Sensing in Wireless Sensor Networks- A GAF-Based Approach
Routing and data aggregation are two important techniques for reducing communication cost of wireless sensor networks (WSNs). To minimize communication cost, routing methods can be merged with data aggregation techniques. Compressive sensing (CS) is one of the effective techniques for aggregating network data, which can reduce the cost of communication by reducing the amount of routed data to t...
متن کاملA New Wavelet Based Spatio-temporal Method for Magnification of Subtle Motions in Video
Video magnification is a computational procedure to reveal subtle variations during video frames that are invisible to the naked eye. A new spatio-temporal method which makes use of connectivity based mapping of the wavelet sub-bands is introduced here for exaggerating of small motions during video frames. In this method, firstly the wavelet transformed frames are mapped to connectivity space a...
متن کاملFirst record of Nais elinguis Müller, 1773 (Annelida: Oligochaeta: Naididae), Spatio-temporal patterns of its population density and biomass production along two estuaries in the South Caspian Sea, Mazandaran Province, Iran
The cosmopolitan oligochaete worm, Nais elinguis, is common to fresh and brackish water habitats. This species was found while investigating the limnology of two rivers alongside the Iranian coasts and has not been reported in the Iranian freshwater fauna and Caspian Sea before. N. elinguis was collected bimonthly from Cheshmehkileh and Sardabrood estuaries using a Van Veen grab (0.03 m2) and S...
متن کاملModeling and Spatio-Temporal Analysis of the Distribution of O3 in Tehran City Based on Neural Network and Spatial Analysis in GIS Environment
Air pollution is one of the most problems that people are facing today in metropolitan areas. Suspended particulates, carbon monoxide, sulfur dioxide, ozone and nitrogen dioxide are the five major pollutants of air that pose many problems to human health. The goal of this study is to propose a spatial approach for estimation and analyzing the spatial and temporal distribution of ozone based on ...
متن کامل